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Abstract 

Interacting with computers is a ubiquitous activity for mil¬ 
lions of people. Repetitive or specialized tasks often require 
creation of small, often one-off, programs. End-users strug¬ 
gle with learning and using the myriad of domain-specific 
languages (DSLs) to effectively accomplish these tasks. 

We present a general framework for constructing program 
synthesizers that take natural language (NL) inputs and pro¬ 
duce expressions in a target DSL. The framework takes as in¬ 
put a DSL definition and training data consisting of NL/DSL 
pairs. Lrom these it constructs a synthesizer by learning op¬ 
timal weights and classifiers (using NLP features) that rank 
the outputs of a keyword-programming based translation. 
We applied our framework to three domains: repetitive text 
editing, an intelligent tutoring system, and flight information 
queries. On 1200 h- English descriptions, the respective syn¬ 
thesizers rank the desired program as the top-1 and top-3 for 
80% and 90% descriptions respectively. 

1. Introduction 

Program synthesis is the task of automatically synthesiz¬ 
ing a program in some underlying domain-specific lan¬ 
guage (DSL) from a given specification [10]. Traditional 
program synthesis, that of synthesizing programs from com¬ 
plete specifications [16, 33, 46, 47], have not yet seen wide 
adoption as it is often difficult to automatically check that a 
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specification is satisfied by the synthesized program. More 
significantly, these specifications are difficult to write. 

Recent work has experimented with another class of (pos¬ 
sibly incomplete) specifications, namely examples [7, 11, 12, 
18, 29]. Programming by Example (PEL) systems have seen 
much wider adoption, thanks to the ease of providing such 
a specification. However, they are not ideal for specifying 
certain kinds of operations such && filter or reduce. In partic¬ 
ular, conditional operations generally require examples ex¬ 
ercising both the true and false paths leading to rapid growth 
in the number of examples needed. The classic L* algo¬ 
rithm [3], a PEE system for describing a regular language, 
has the well-known drawback of requiring too many exam¬ 
ples. Even state-of-the-art PEE systems like ElashEill [13] 
are limited in their handling of conditionals. Moreover, in 
such tasks, PEE requires analyzing the contiguous chunk of 
input data on which edits have been performed. This leads 
to scalability issues on text files, which are much larger than 
strings. (ElashEill is restricted to work with strings with at 
most 256 characters). Describing tasks using examples have 
other drawbacks: Eor some domains like air travel informa¬ 
tion systems (ATIS), it is not even clear what an “example” 
is. It turns out that operations like filter and reduce, and their 
compositions can be specified much more easily and con¬ 
cisely using natural language (NL). 

In this paper, we address the problem of synthesizing 
programs in an underlying DSL from NL. NL is inherently 
imprecise; hence, it may not be possible to guarantee the 
correctness of the synthesized program. Instead, we aim to 
generate a ranked set of programs and let the user possibly 
select one of those programs by either inspecting the source 
code of the program, or the result of executing that program 
on some test inputs. This user interaction is similar to what is 
employed in any PEE technology like Elash Eill. The reader 
may also liken this process to how users search for and select 
their desired results in a search engine. Eurther, similar to a 
search engine, the synthesis algorithm in this paper is able to 
consistently produce and rank the desired result program in 
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(a) Grammar 

S:= Command I SEQ(Command, Command) 

Command:= ReplaceCmd | RemoveCmd | InsertCmd | PrintCmd 
ReplaceCmd:= REPLACE(SelectStr, NewString, IterScope) 
RemoveCmd:= REMOVE(SelectStr, IterScope) 

InsertCmd:= INSERT(PString, Position, IterScope) 

PrintCmd:= PRINT(SelectStr, IterScope) 

SelectStr:= (Tol^en, BCond, Occurrence) 

IterScope:= (Scope, BCond, Occurrence) I DOCUMENT 
Token:= PString | WORDTOK | NUMBERTOK | ... 

BCond:= AtomicCond |N0T(AtomicCond) I AND(AtomicCond,AtomicCond) 
AtomicCond:= STARTSWITH(Token) | CONTAINS(Token) | CommonCond 
CommonCond:= BETWEEN(Token, AnotherToken)) AFTER(Token)| ... 
PString:= Constantstring 1 WORD(Constantstring) |... 

Occurrence:= ALL I AtomicOccurrence i ... 

AtomicOccurrence:= IntSetByAnd | FirstFew(Integer) | ... 
IntSetByAnd:= Integer | INTSET(IntegerSet) 

Scope:= LINESCOPE | WORDSCOPE 

AnotlierToken:= TO (Token) Position := START | END | . 

NewString:= BY(PString) ConstantString:= String 

String:= <STRING> Integer:=<INTEGER> 


(b) Sample Benchmarks 

1. Remove the first word of lines which start with number. 

2. Replace “&” with “&&” unless it is inside “[” and 

3. Add “$” at the beginning of those lines that do not 
already start with 

4. Add “..???” at the last of every 2nd statement. 

5. In every line, delete the text after 

6. Remove 1st from every line. 

7. Add the suffix “_IDM” to the word right after “idiom:”. 

8. Delete all but the 1st occurrence of “Cook”. 

9. Delete the word “the” wherever it comes after “all”. 

10. Print data between “<url>” and “</url>”. 


(c) Variations in NL for description of the same task. 

1. Prepend the line containing “P.O. BOX” with 

2. Add a “*” at the beginning of the line in which the 
string ”P.O. BOX” occurs 

3. Put a “*” before each line that has “P.O. BOX” in it 

4. Put “*” in front of lines with “P.O. Box” as a substring 

5. Insert “*” at the start of every line which has “P.O. 
BOX” word 


Table 1: Grammar and sample benchmarks for the Text Editing domain. 


the top spot, over 80% of the time, or in the top 3 spots, over 
90% of the time in our benchmarks. To give users confidence 
in the program they choose, we show both the translation of 
the code into disambiguated English and/or run it to show 
the result as a preview. 

Unlike most of the existing synthesis techniques that spe¬ 
cialize to a specific DSL, our methodology can be applied 
to a variety of DSLs. Our methodology requires two in¬ 
puts from the synthesis designer: (i) The DSL definition, (ii) 
Training data consisting of example pairs of English sen¬ 
tences and corresponding intended programs in the DSL. A 
training phase infers a dictionary relation over pairs of En¬ 
glish words and DSL terminals (in a semi-automated inter¬ 
active manner), and optimal weights/classifiers (in a com¬ 
pletely automated manner) for use by the generic synthe¬ 
sis algorithm. Our approach can be seen as a meta-synthesis 
framework for constructing NL-to-DSL synthesizers. 

The generic synthesis algorithm (Alg. 1) takes as input 
an English sentence and generates a ranked set of likely pro¬ 
grams. First, it uses a bag algorithm (Alg. 2) to efficiently 
compute the set of all consistent DSL programs whose ter¬ 
minals are related to the words that occur in the sentence. 
For this, it uses a dictionary (learned during the training 
phase) that is a relation over English words and DSL termi¬ 
nals. Then, it ranks these programs based on a set of scoring 
functions (§4.2). These functions are inspired by our view of 
the Abstract Syntax Tree (AST) of a program as involving 
two constituents: the set of terminals in the program, and the 
tree structure between those terminals. We use a weighted 
linear combination of 3 scores to determine the rank of each 
program: (i) a coverage score that captures the intuition that 
results that ignore many words in the user input are unlikely 


to be correct (ii) a mapping score that captures the intuition 
that English words can have multiple meanings and intended 
actions wrt. the DSL but we prefer the more probable inter¬ 
pretations (iii) a structure score that uses the insight that both 
natural language and programming languages have common 
idiomatic structures [20], and prefer the more natural re¬ 
sults. 

The classifiers to compute these scores, as well as the 
weights for combining the scores are learned during the 
training phase using off-the-shelf machine learning algo¬ 
rithms. The novelty of of our approach lies in the genera¬ 
tion of training data for classifier learning from the top-level 
training data (Alg. 3 and 4), and in smoothing a discrete scor¬ 
ing metric into a continuous and differentiable loss function 
for effective learning of weights (§5.4). 

This paper makes the following contributions: 

• We describe a meta-synthesis framework for construct¬ 
ing NL-to-DSL synthesizers, consisting of a synthesis al¬ 
gorithm (§4) for translating English sentences into corre¬ 
sponding programs in the underlying DSL, and a training 
phase for learning a dictionary and weights that are used 
by the synthesis algorithm (§5). Our methodology can be 
applied to new DSLs, and requires only the DSL defini¬ 
tion along with translation pair training data. 

• We apply our generic framework to three different do¬ 
mains, namely automating end-user data manipulation 
(§2.1), generating problem descriptions in intelligent tu¬ 
toring systems (§2.2), and database querying (§2.3). In 
cases where comparisons can be made with state-of-the- 
art NLP based approaches, the results of the approach pre¬ 
sented in this paper are competitive. 
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• We gather an extensive corpus of data consisting of 1272 
pairs of English descriptions and corresponding pro¬ 
grams. We use this data for evaluation in this paper, and 
provide it as a resource for researchers in the commu¬ 
nity. Of these, 535 English descriptions come from the 
Air Travel Information System (ATIS) benchmark suite, 
227 come from another corpus [34], while 510 English 
descriptions were collected by us from various online 
sources (including help forums and course materials), 
textbooks, and user studies. 

• We evaluate the effectiveness of our approach on 3 dif¬ 
ferent DSLs (§6). The NL-to-DSL synthesizers produced 
by our framework run in 1 — 2 seconds on average per 
benchmark and produce a ranked set of candidate pro¬ 
grams with the correct result in the top-l/top-3 choices 
for over 80%/90% benchmarks respectively. 

2. Motivating Scenarios 

We describe 3 different domains where a NL-to-DSL syn¬ 
thesizer is useful: text editing (§2.1), automata construc¬ 
tion problems for intelligent tutoring (§2.2), and answering 
queries for an air travel information systems (§2.3). 

2.1 Text Editing (End-User Programming) 

Through a study of help forums for Office suite applications 
like Microsoft Excel and Word, we observed that users fre¬ 
quently request help with repetitive text editing operations 
such as insertion, deletion, replacement, or extraction in text 
files. These operations (Table 1(b)) are more complicated 
than simple search-and-replace of a constant string by an¬ 
other in two ways. Eirst, the string being searched for is 
often not constant and instead requires regular expression 
matching. Second, the editing is often conditional on the sur¬ 
rounding context. Programming of even such relatively sim¬ 
ple tasks requires the user to understand syntax and seman¬ 
tics of regular expressions, conditionals, and loops, which 
are beyond the ability of most end-users. 

This inspired us to design a command language for text¬ 
editing (a subset of the grammar is shown in Table 1(a)) 
that includes key commands Insert, Remove, Print and 
Replace. Each of these commands relies on an IterScope 
expression that specifies the region (a set of lines, a set of 
words, or the entire Word document) that the text editing op¬ 
eration is on. The SelectStr production includes a Token, 
which allows for limited wild-card matching (e.g., an entire 
WORDTOK, NUMBERTOK, or a pattern specified by PString), 
a Boolean condition BCond that acts as an additional (lo¬ 
cal) filter on the matched value, and an Occurrence value 
that performs an index based selection from the resultant 
matches. Use of the occurrence values like FirstFew(N) 
(from AtomicOccurrence) when performing a Remove re¬ 
sults in the removal of only the first N items (here N is a pos¬ 
itive integer) that match the condition, while use of ALL will 
instead result in all matches of the condition being removed. 


1. Consider the set of all binary strings where the difference 
between the number of “0” and the number of “1” is even. 

2. The set of strings of “0” and “1” such that at least one of 
the last 10 positions is a “1”. 

3. the set of strings w such that the symbol at every odd 
position in w is “a”. 

4. Let LI be the set of words w that contain an even number 
of “a”, let L2 be the set of words w that end with “b”, let L3 
= LI intersect L2. 

5. The set of strings over alphabet 0 to 9 such that the final 
digit has not appeared before. 


Table 2: Sample benchmarks for the Automata domain. 

The Boolean conditions BCond cover the standard range of 
string matching predicates (CONTAINS, STARTSWITH etc.) 
and allow conjunction of conditions (AND, NOT etc.). The 
CommonCond production specifies the position relative to 
the string token(s) that occurs after it (AFTER), before it 
(BEFORE), or around it (BETWEEN) acts as another (global) 
filter. Table 1(c) describes a sample of the variations that our 
system can handle for description of a task that is expressible 
in our DSL. 

Example 1 . For the text editing task described in task 1 in 
Table 1(b), our system produces the following translation: 
REMOVE (SelectStr(WORDTOK, ALWAYS, INTEGER(1)) , 
IterScope(LINESCOPE, STARTSWITH(NUMBERTOK) , 
ALL) ) 

Example 2. Given the English description for task 2 in Ta¬ 
ble 1(b) our system produces the following translation: 
REPLACE (SelectStr(STRING(&), NOT(BETWEEN) 
STRING([), TO(STRING(])))) , ALL), BY(STRING) 
&&) ) , DOCUMENT)) 

Our belief is that once users are able to accomplish these 
types of smaller conditional and repetitive tasks, they can 
easily accomplish other more complex tasks by reducing 
them to a sequence of smaller tasks. 

2.2 Automata Theory (Intelligent Ttitoring) 

Results from formal methods research have been used in 
many parts of intelligent tutoring systems [14] including 
problem generation, solution generation, and especially 
feedback generation for a variety of subject domains in¬ 
cluding geometry [17] and automata theory [1]. Each of 
these domains involves a specialized DSL that is used by a 
problem generator tool to create new problems, a solution 
generation tool to produce solutions, and more significantly, 
a feedback generation tool to provide feedback on student 
solutions. 

Consider the domain of automata constructions, where 
students are asked to construct an automaton that accepts a 
language whose description is provided in English (Eor some 
examples, see Table 2). We designed a DSL based on the de¬ 
scription provided by Alur et.al. [1] on constructs required to 
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1.1 would like the time of your earliest flight in the morning 
from Philadelphia to Washington on American Airlines. 

2.1 need information on a flight from San Francisco to Atlanta 
that would stop in Fort Worth. 

3. What is the earliest flight from Washington to Atlanta leav¬ 
ing on Wednesday September fourth. 

4. Okay we’re going from Washington to Denver first class 
ticket I would like to know the cost of a first class ticket. 

5. What ground transportation is there from the airport in 
Atlanta to downtown. 


Table 3; Sample benchmarks for the AXIS domain. 

formally specify such languages. This DSL contains predi¬ 
cates over strings. Boolean connectives, functions that return 
positions of substrings, and universal/existential quantifica¬ 
tion over string positions. As stated in [1], such a language is 
used to generate feedback for students’ incorrect attempts in 
two ways: (i) it is used by a solution generation tool to gen¬ 
erate correct solutions against which a student’s attempts are 
graded, (ii) it is also used to provide feedback and generate 
problem variations consistent with a student’s attempt. This 
feedback generation tool has been deployed in the classroom 
and has been able to assign grades and generate feedback in 
a meaningful way while being both faster and more consis¬ 
tent than a human. Our synthesis methodology can be used 
to automatically generate the specifications needed by this 
system from natural language descriptions. 

Example 3 . Specification 1 in Table 2 is translated as: 
ISEVEN(DIFF(COUNT(STRING(0)) , COUNT(STRING(1) 
))) 

Example 4. Specification 2 in Table 2 is translated as: 
EXISTS TNT (LASTFEW (INTEGER (10) ) , STREQUALS ( 
SYMBOLATP () , STRING (1) ) ) 

2.3 Air Travel Information Systems (AXIS) 

AXIS is a standard benchmark for querying air travel in¬ 
formation, consisting of English queries and an associated 
database containing flight information. It has long been used 
as a standard benchmark in both natural language process¬ 
ing and speech processing communities. Table 3 shows some 
sample queries from the AXIS suite. Eor AXIS, we designed a 
DSL that is based around SQL style row/column operations 
and provided support for predicates/expressions that corre¬ 
spond to important concepts in air-travel queries, arrival/de¬ 
parture locations, times, dates, prices, etc. 

Example 5. The first query in the ATIS examples. Table 3, 
is translated into our DSL as: 

ColSet (AtomlcColSet (DEP.TIME () ) , ROWJXIN ( 
DEP_TIME0 , AtomicRowPredSet (AtomicRowPred ( 
EQ^DEP (CITY (Philadelphia)), Unit-TimeSet ( 

TIME(morning)) , EQ-ARR(CITY(Washington)) , 
EQ_AIRLINES(AIRLINES(american)))))) 


3. Problem Defi ni tion 

We study the problem of synthesizing NL-to-DSL synthe¬ 
sizers, given a DSL definition and example training data. 
A DSL L = {G,VC) consists of a context free grammar G 
(with terminal symbols denoted by Gt and production rules 
denoted by Gr), and a syntactic/semantic checker VC that 
can check whether or not a given program belongs to the 
grammar G and is semantically meaningful. The training 
data consists of a set of pairs {S,P), where S is an En¬ 
glish sentence and P is the corresponding intended program 
from the DSL L. A sentence is simply a sequence of words 
[wi,W 2 ,... ,w„]. The goal of the generated NL-to-DSL syn¬ 
thesizer is to translate an English sentence to a ranked set of 
programs, [Pi,P 2 ,... ,Pk], in L. 


4. NL to DSL Synthesis Algorithm 

Our synthesis algorithm (Algorithm 1) takes a natural lan¬ 
guage command from the user and creates a ranked list of 
candidate DSL programs. The first step (loop on line 2) is to 
convert each of the words in the user input into one or more 
terminals (function names or values) using the NL to pro¬ 
gram terminal Dictionary NLDict. This loop ranges over the 
length of the input sentence and for each index looks up the 
set of terminals in the DSL that are associated with the word 
at that index in NLDict. Eundamentally, NLDict encodes, for 
each terminal, which English language words are likely to 
indicate the presence of that terminal in the desired result 
program. This map can be constructed in a semi-automated 
manner (§5.3). Once this association has been made for a 
terminal t we store a tuple of the terminal and a singleton 
map, relating the index of the word to a terminal that was 
produced, into the set Rq (line 4). 

Eor each natural language word, the dictionary NLDict 
associates a set of terminals with it. The terminals may 
be constant values or function applications with holes (□) 
as arguments. Thus, algorithm (Algorithm 1) (when ap¬ 
plying NLDict on line 3) can create incomplete programs, 
where some arguments to functions are missing. Eor ex¬ 
ample, Consider the sentence “Print all lines that do not 
contain 834”. Since the grammar contains PrintCmd : = 
PRINT (SelectStr, IterScope) as a production and the 
dictionary relates the word “print” to the function PRINT, 
the partial program PRINT (□, □) will be generated. These 
holes are later replaced by other programs that match the 
argument types SelectStr and IterScope. 

Once the base set of terminals has been constructed, the 
algorithm uses the Bag algorithm (Algorithm 2) to generate 
the set of all consistent programs, Resj, that can be con¬ 
structed from it (line 5). The final step is to rank (§4.2) this 
set of programs, using a combination of scores and weights, 
in the loop on line 7. 
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Algorithm 1: NL to DSL Synthesis Algorithm 
Input: NL sentence S, Word-to-Terminal dictionary NLDict 
Output: Ranked set of programs 

1 Rq 0; 

2 for I e [0,5.Lengt/!—1] do 

3 T i- NLDict{S[{\); 

4 foreach t e T do Ro ^ RqU (f, SingletonMap{i, t)) 

5 Resr ^ Bag{S,Ro); 

6 Resp t— {P\3M s.t. (P,M) G Resp}', 

7 foreach program P G Resp do score{P) < -oo foreach 

program {P,M) G Resp do 

s ^cov -t- CoverageScore{P,S,M) x (Ocov', 

9 Smap MappingScore{P, S,M)x Oimap \ 

to Sstr ^ StructureScore{P,S,M) x (Ostr', 

11 ^ score{P)-i^max{score{P),Scov+Smap+Sstry, 

12 return set of programs in Resp ordered by score 


Algorithm 2: Bag 


Input: NL sentence S, Initial Tuple Set Bq 

1 result ^ Br)\ 

2 repeat 

3 oldResult result, 

4 foreach (Ri, ), (P2, ^2) £ result do 

5 okpc P\ is partial A P 2 is complete ; 

6 disjoint<r~ 

UsedWords(S,M\) n UsedWords{S,M 2 ) = &; 

7 if okpc A disjoint then 

8 combs -t— Sub All (Ri, R 2 ) ^ { L }; 

9 new ^ {{Pr,M\yjM2)\Pr & combs}'. 


to 

11 

12 


result <r- resultUnew, 
until oldResult = result, 
return result. 


4.1 Synthesizing Consistent Programs 

A program P in the DSL is either an atomic value (i.e., a 
terminal in G), or a function/operator applied to a list of 
arguments. By convention we represent function application 
as s-expressions where a function F applied to k arguments 
is written .. ,Pk). 


Consistent Programs and Witness Maps. Given a DSL 
L = (G, yC), we say a program P in language L is consistent 
with a sentence S if there exists a map M that maps (some) 
word occurrences in S to terminals in Gp such that the range 
of M equals the set of terminals in the program P. * We refer 
to such a map M as a witness map, and use the notation 
WitnessMaps(P,5) to denote the set of all such maps. 

Usable and Used Words. Let S be an English sentence, 
P be a program consistent with S, and M be any witness 
map. UsabteWords{S) are those word occurrences in S that 
are mapped to some grammar terminal and hence might be 
useful in translation. UsedWords{S,M) is the set of usable 
word occurrences in S that are used as part of the map M. 

UsableWords{S) = {i | 5[/] G Domain{NLDict)} 
UsedWords{S,M) = UsabteWords{S)FDomain{M) 

Partial Programs. A partial program extends the notion 
of a program to also allow for a hole (□) as an argument. 
A hole is a symbolic placeholder where another complete 
program (program without any hole) can be placed to form 
a larger program. To avoid verbosity, we often refer to a 
partial program as simply a program. 


* Since the same English word can occur at different positions in S, having 
different meanings, any map M must take the position information also as 
an argument. For simplicity of exposition, we ignore this in the paper. 


Given a partial program P = (P, with a hole □, 

we can substitute a complete program P' to fill the hole; 


p[n ^ p'] = 


(p,...,p',...) ifyc((p,...,p',...)) 

_L otherwise 


The validity check, VC, ensures that all synthesized pro¬ 
grams are well defined in terms of the DSL grammar and 
type system (otherwise we return the invalid program _L). 

Combination. The combination operator SubAll generates 
the set of all programs that can be obtained by substituting 
a complete program P' in some hole of a partial program 
P. This is done by going over all the arguments of P and 
producing substitutions for argument positions with holes. 
Given partial program P = (P,Pi,... ,Pa.) and complete pro¬ 
gram P', we have: 

SubAll{P,P') = {P[n/ ^ P']|P,' = □/} 


Bag Algorithm. The Bag algorithm (Algorithm 2) is based 
on computing the closure of a set of programs by enumerat¬ 
ing all possible well-typed combinations of the programs in 
the set. The main loop (line 2) is a fixpoint iteration on the 
result set of programs that have been constructed. 

The requirement that P 2 is a complete program (line 5) 
when applying the SubAll function ensures that the only 
holes in the result programs are holes that were originally in 
Pi. We restrict the initialization of Bq to include only com¬ 
plete programs and partial programs with holes at the top 
level only. Using this restriction we can inductively show 
that at each step all partial programs only have holes at the 
top-level. Thus, we can efficiently compute the fixpoint of 
all possible programs in a bottom-up manner. The condition 
UsedWords{S,Mi) n UsedWords{S,M 2 ) = 0 (line 6) ensures 
that the two programs do not use overlapping sets of words 
from the user input. This ensures that the final program can¬ 
not create multiple sub-programs with different meanings 
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from the same part of the user input. This also ensures that 
the set of possible combinations has a finite bound based on 
the number of words in the input. Line 8 constructs the set of 
all possible substitutions of P 2 into holes in Pi (ignoring any 
invalid results). For each of the possible substitutions we add 
the result (and the union of the M maps) to the new program 
set (line 9). Since the domains of the maps were disjoint the 
union operation is well-defined. 

The Bag algorithm has a high recall, but, in practice, it 
may generate spurious programs that arise as a result of 
arbitrary rearrangement of the words in the English sentence. 
To account for this, the correct translation is reported by 
selecting the top-most rank program based on features of the 
program and the parse tree of the sentence. 


4.2 Ranking Consistent Programs 

We view the abstract syntax tree of the synthesized program 
as consisting of two important constituents: the set of termi¬ 
nals in the program, and the tree structure between those ter¬ 
minals. We use these constituents to compute the following 
three scores to determine the rank of a consistent program: 
(i) a coverage score that reflects how many words in the En¬ 
glish sentence were mapped to some operation or value in 
the program, (ii) a mapping score that reflects the likelihood 
that a word-to-terminal mapping is capturing the user intent 
(iii) a structure score that captures the naturalness of the tree 
program structure and the connections between parts of the 
program and the parts of the sentence that generated them. 


4.2.1 Coverage Score 

Eor a given sentence S, a candidate translation P, and a 
witness map M, the coverage score is defined as. 


CoverageScore{P, S,M) 


I UsedWords{S,M) \ 
I UsableWords{S) \ 


The CoverageScore{P,S,M) denotes the fraction of available 
information in S that is actually used to generate P. Intu¬ 
itively we want to prefer programs that make more use of 
the information provided by the user input 


Example 6. Consider possible translations for an input S: 

S: find the cheapest flight from Washington 
to Atlanta 

Pp. MIN_F (COL^FARE () , AtomicRowPredSet (EQ^DEP ( 

CITY(Washington)), EQJiRR(CITY (Atlanta)))) 
P 2 ’. AtomicRowPredSet (EQ-DEP (CITY (Washington) ) , 
EQ_ARR (CITY(Atlanta) ) ) 

The first program Pi makes use of all parts of the user in¬ 
put, including the desired cheapest fare, while the second 
program P 2 ignores this information. The Coverage score 
enables us to rank Pi higher than P 2 . 


4.2.2 Mapping Score 

Eor any word w there may be multiple terminals (functions 
or values) in the set NLDictlw) each of which corresponds 


to a different interpretation of w. We use machine learning 
techniques to obtain a classifier Cmap based on the part-of- 
speech (POS) tag provided for the word by the Stanford 
NLP engine [24]. Cmap.Predict function of the classifier pre¬ 
dicts the probability of each word-to-terminal mapping be¬ 
ing correct. We use predictions from Cmap to compute the 
MappingScore, the likelihood that terminals in P are correct 
interpretation of corresponding words in S. 

MappingScore {P,S,M) = 

n C„ap.Predict(w,POS(w,5),M(w)) 

wG UsableWords(S) 

A limitation of the MappingScore score is that it looks 
only at the mapping of a word but not its relation to other 
words and how they are are mapped by the translation. Thus, 
interchanging a pair of terminals in a correct translation 
gives us an incorrect translation which has the same score. 

Example 7. Consider the input S and two possible trans¬ 
lations: 

S: If ' 'XYZ'' is at the beginning of the line, 
replace ''XYZ'' with ''ABC'' 

Pp. REPLACE (SelectStr (STRING (XYZ), ALWAYS)), 
ALLO), BY (STRING (ABC) ) , IterScope ( 
LINESCOPE0 , STARTSWITH(STRING(XYZ)) , 

ALL 0 ) ) 

P 2 : REPLACE (SelectStr (STRING (XYZ), ALWAYS () , 
ALLO), BY (STRING (ABC) ) , IterScope ( 
LINESCOPE0 , BEEORE(STRING(XYZ)) , ALL())) 

Both the programs use same sets words, so they have 
the same coverage score. The only difference is that the 
word “beginning" is mapped to STARTSWITH (POS: Verb 
Phrase) in Pi, and to BEFORE (POS: Prepositional Phrase) 
in P 2 . Mapping score helps in identifying Pi as the correct 
choice. 

4.2.3 Structure Score 

Structure score captures the notion of naturalness in the 
placement of sub-programs. We use connection features ob¬ 
tained from the sentence S, the natural language parse tree 
for the sentence, NLParse(5), and the corresponding pro¬ 
gram P to define the overall structure score. These features 
are used to produce the classifier Cst, which computes the 
probability that each of the combinations in P is correct. 

Definition 1 (Connection). For a production R G Gr of 
the form N ^ Ni.. .Ni.. .Nj ...Nj,, the tuple (Rf,j) where 
1 < /, j < k, and i f j is called a connection. 

Definition 2 (Combination). Consider the program P = 
(Pi, 7*2) •••) Pk) generated using the production R: N ^ 
N 1 N 2 ■. .Nk, such that Ni generates Pi for \ < i < k. We say 
the pair of sub-programs {Pi,Pj) is combined via connection 
{R,i,j) and this combination is denoted as Conn{Pi,Pj). 
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The overall StructureScore is obtained by taking the ge¬ 
ometric mean of the various connection probabilities of the 
scores for the program P —this normalizes the score to ac¬ 
count for programs with differing numbers of connections. 

StructureScore{P, S,M) = GeometricMean(ConnProhs(P, S,M)) 
ConnProbs{P,S,M)= [J {Cjrr[Conn].Predict(/, 1)} 

Conn{Pj,Pj) in P 

where f = {fposl ? fpos2 ? flcal : flcal : /order: /over: fdist) 

computed for P, and Pj using P, S, and M. 

We obtain separate classifier, C 5 fr[Conn], for each con¬ 
nection Conn. The function Cj,r[Conn].Predict asks the clas¬ 
sifier to predict the probability that f-vec belongs to class 1 
(i.e., present in correct translation). The other class is 0. 

Given a program P and input sentence S that are related 
by a witness map M and the parse tree NLParse(5), the 
following functions define several useful relationships: 

TreeCover(P,5,M) = minimal sub-tree Tjub of NLParse(5) 
s.t. UsedWords{S,M) C UsableWords{Tsub) 

Root(P,5,M) = root node of TreeCover(P,5,M) 

Span(P,M) = [Min{Domain{M)) ,Max{Domain{M))] 

In the rest of the section, we assume that Pi and P 2 denote 
two sub-programs of P. The following features determine 
the naturalness of the connections between P, Pi, P 2 , and S : 

Definition 3 (Root POS Tags). Part-of-speech features 
are the POS tags assigned by the NL Parser to the root nodes 
of the sub-trees associated with P\ and P 2 respectively: 

/posi =POS{Root{PuS,M)) 

fpo.2=POS{Root{P2,S,M)) 

The features fposi and fpo^i hdp to learn the phrases that 
are commonly combined using a particular connection. 

Definition 4 (LCA Distances). Let LCA be the least- 
common-ancestor of Root{Pi,S,M) andRoot{P 2 ,S,M). The 
least-common-ancestor distance features are the tree-distances 
from LCA to the root nodes of the sub-trees associated with 
Pi and P 2 respectively: 

/leal = TreeDistance{LCA,Root{Pi,S,M)) 
flcal = TreeDistance{LCA,Root{P 2 ,S,M)) 

Definition 5 (Order). The order feature is determined by 
the positions of the roots of the sub-tree roots associated with 
Pi and P 2 in the in-order traversal ofNLParse(S). 

{ 1 if Root{Pi,S,M) occurs before Root{P 2 ,S,M) 
in in-order traversal of NLParse(S) 

— 1 otherwise 


Features ficai, flcal and /order are used to learn the corre¬ 
spondence between the parse tree structure and the program 
structure. We use these to maintain the structure of transla¬ 
tion close to the structure of the parse tree. 

Definition 6 (Overlap). The overlap feature captures the 
possibility that two programs are constructed from mixtures 
of two subtrees in the NL Parse tree: 

{ 1 if Span{P\ ,M\).end < Span[P 2 ,M 2 ).start 
— 1 if Span{P\ ,M\).start > Span{P 2 ,M 2 ).end 
0 otherwise 

Definition 7 (Distance). Given two programs Pi and P 2 
we define the distance feature for programs by looking at 
the distance between the word spans used in the programs: 

{ Span{P 2 ,M 2 ).start — Span{Pi,M\).end (f/over = 1 
Span{Pi, Ml). start-Span{P 2 :M 2 ).end if fover = -1 
0 otherwise 

The features fover and fdist capture the proximity infor¬ 
mation of words and are useful because related words often 
occur together in the input sentence. 

Example 8. Consider possible translations for an input S: 

S'. Print all lines that do not contain “834” 

Pi: PRINT(SeleclStr(LINETOK, NOT(CONTAINS( 

STRING(834))), ALL()), DOCUMENTO) 

P 2 : PRINT(SelectStr(STRlNG(834), NOT(CONTAlNS( 

LINETOK)), ALL()), DOCUMENTO) 

In the parse tree NLParse(S), “print” will have two argu¬ 
ments, what to print (“lines”) and when to print(“not con¬ 
tain 834”). We observe the following for the candidate pro¬ 
grams: (a) The word “lines” is closer to “print”, while the 
word “834” is farther in NLParse(S). The same structure is 
observed for Pi, but not for P 2 . This is captured by LCA Dis¬ 
tances. (b) The order of the words in NLParse(S) matches 
the order in Pi than in P 2 . This is captured by the Order fea¬ 
ture. (c) The phrase “do not contain 834” is kept intact in 
Pi, but is split apart in P 2 . Overlap and Distance features 
will capture this splitting and reordering. 

Both the programs use the same set of words, and the 
same word to terminal mappings, resulting in the same cov¬ 
erage score and the same mapping scores. However, the pro¬ 
gram Pi is correct and our choice of features rank it higher. 

4.3 Combined Score Example 

To provide some intuition into the complementary strengths 
and weaknesses of the various scores, we examine how 
they behave on a subset of the programs generated by the 
Bag algorithm for the following text editing task: Add a 
at the beginning of the line in which the 
string ''P.O. BOX'' occurs. Table 4 shows some of 
the consistent programs generated by the Bag algorithm. 


7 


2015 / 9/2 



Program Generated 

Coverage 

Score 

Mapping 

Score 

Structure 

Score 

Final 

Score 

Pi 

INSERT(STRING(*), START, IterScope(LINESCOPE, CONTAINS(STRING(P.O. BOX)), ALL)) 

8.33 

5.73 

4.45 

322.17 

Pi 

INSERT(STRING(*), START, IterScope(LINESCOPE, ALWAYS, ALL)) 

5.00 

8.40 

4.45 

248.17 

P 3 

INSERT(STRING(*), START, IterScope(LINESCOPE, STARTSWITH(STRING(P.O. BOX)), ALL)) 

6.67 

6.35 

1.09 

232.57 

P 4 

INSERT(STRING(P.O. BOX), START, IterScope(LINESCOPE, CONTAINS(STRING(*)), ALL)) 

8.33 

5.73 

1.00 

272.43 

Ps 

INSERT(STRING(*), START, DOCUMENT) 

3.33 

4.74 

6.84 

216.33 


Table 4; Ranking the set of consistent programs generated by the Bag algorithm. 


The first program (Pi) is the intended translation. Let us 
look at the performance of each of the component scores; 
Coverage Score; Both Pi and P 4 use the maximum number 
of words from the sentence, and are tied on top score. P 4 . is 
a wrong program as it attempts to add “R O. BOX” at the 
beginning of the line containing 

Mapping Score; The classiher learnt by our system maps 
the word “beginning” to the terminal STARTSWITH with 
a high probability but to the terminal START with a lower 
probability. Further, it maps “occurs” to the terminal CON¬ 
TAINS with a still lower probability. Pj does not use the 
word “occur”, otherwise it has same mappings as Ri. As a 
result it has higher mapping score than Pi, but suffers on 
coverage. R 3 maps “beginning” to STARTSWITH, and does 
not use the word “occurs”. As a result it has a mapping score 
lower than P 2 but higher than Pi. If we had used the mapping 
score alone, we would not have been able to rank the desired 
program Pi above the incorrect programs R 3 and P 4 . 
Structure Score; Coverage score and mapping score look 
only at the mapping of a word but not its relation to other 
mappings and their placement with respect to the original 
sentence. Structure score fixes this by considering struc¬ 
tural information (parse tree, ordering of words and dis¬ 
tance among words) from the sentence. P 4 has poor struc¬ 
ture score because it swaps the sentence ordering for strings 
and “P.O. BOX”. R 3 also suffers as it moves “beginning” 
(mapped to STARTSWITH) away from “Add” (mapped to 
INSERT). Pi gets a high structure score as it maintains the 
parse tree structure of the input text. Note that, P 2 and P$ 
have high structure score as well. This is because structure 
score does not take into account the fraction of used words 
or word-to-terminal mappings. So, an incomplete translation 
that uses very few words but maps them to correct terminals 
and places them correctly, is likely to have a high value. 

The desired program. Pi , is only top ranked by one of the 
scores and even in that case, the score is tied with another 
incorrect result. However, a combination of the scores with 
appropriate weights (§5) ranks Pi as the clear winner! 

5. Training Phase 

This section describes the learning of classifiers, weights, 
and the word-to-terminal mapping used by the synthesis 
algorithm described in §4. The key aspects in this process are 

(i) deciding which machine learning algorithm to use, and 

(ii) generation of (lower level) training data for that machine 


Algorithm 3: Learning Mapping Score Classifier C^ap 

Input; Training Data T 

1 foreach training pair {S,P) S T do 

2 M 4—WitnessMaps(R,5); 

3 M 4— argmaxj^j,^^ ( Likeability{P, 5, M' )); 

4 foreach (w,f) e M do 

5 |_ Cmap.Train(w,POS(w,5),f) 

6 return C„tap-, 


learning algorithm from the top level training data provided 
by the DSL designer. 

5.1 Mapping Score Classifier (Cmap) 

The goal of the C^ap classifier is to predict the likelihood of 
a word w mapping to a terminal t G Gj using the POS tag 
of the word w. The learning of this classifier is performed 
using an off-the-shelf implementation of a Naive Bayesian 
Classifier [6]. The training data for this classifier is generated 
as shown in Algorithm 3. 

The key idea is to first construct the set M of all witness 
maps that can yield program P from natural language input 
S. We then select the most likely map M out of these witness 
maps based on the partial lexicographic order given by the 
likeability score tuples. 

Likeability{P, S,M) = ( UseclWords{S,M), 

Disjointedness{P, S,M)) 

Disjointedness{P,S ,M)= Y, CJ(R') 

p'eSubProgs(p) 

where a((Ri,... ,R„)) =1 if VR,',Ry,R, nR; = 0, 0 otherwise 

The likeability tuples serve two purposes; First, via the 
UsedWords, they guide the system to prefer mappings that 
use all parts of the input sentence. Second, via the 
Disjointedness, they guide the system to prefer mappings 
that penalize the use of a single part of a sentence to con¬ 
struct multiple different subprograms. 

5.2 Structure Score Classifiers (C**-) 

In this section, we describe how the classifiers used in struc¬ 
ture score, Cstr[Conn] for each connection Conn, are learned. 
The goal of each classifier Cstr[Conn\ is to predict the likeli¬ 
hood that a combination c is an instance of connection Conn 
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Algorithm 4: Learning Structure Score Classifiers Cstr 
Input: Training Data “T 

1 foreach training pair {S,P) € T do 

2 AllOpts = SynthNoScore(5); 

3 foreach program P' G AllOpts do 

4 foreach combination c that occurs in P' do 

5 if c occurs in P then class ^ lelse class ^ 

QConn G- connection used by c; 

6 f ^ {fposl j fposl j flcal: flea!: forder-i fover^fdist) i 

7 Cjj;-[Conn].Train(/, class); 

8 return Cstr 


using the 7 features of c from §4.2. We use an off-the-shelf 
implementation of a Naive Bayesian Classifier and generate 
the training data for it as shown in Algorithm 4. 

The key idea is to run the synthesis algorithm without 
the scoring step, SynthNoScore, to construct the set of all 
programs, AllOpts, that can be constructed from the English 
sentence S. Any combination present in a program in P' in 
AllOpts but not present in P is used as a negative example, 
while that present in P is used as a positive example. 

5.3 Dictionary Construction 

We construct the dictionary NLDict in a semi-automated 
manner using the names of the terminals (functions and ar¬ 
guments) in the DSL. If the name of an operation is a proper 
English word, such as INSERT, we use the WordNet [35] syn¬ 
onym list to gather commonly used words which are associ¬ 
ated with the action. Cases where the name is not a simple 
word but instead concatenations of (or abbreviations of) sev¬ 
eral words, such as STARTSWITH, are handled by splitting the 
name and resolving the synonyms of each sub-component 
word. 

It is possible that the general purpose synonym sets pro¬ 
vided by WordNet contain English words that are not use¬ 
ful for the particular domain we are constructing the trans¬ 
lator for. However, the mapping score learning in §5.1 will 
simply assign these words low scores. Once the learning al¬ 
gorithm for the mappings has finished assigning weights to 
each word/terminal we discard all mappings below a certain 
threshold. Conversely, it is also possible that an important 
domain specific synonym will not be provided by the Word- 
Net sets or that the names in the DSL are not well matched 
with proper English words. Our system automatically de¬ 
tects these cases as a result of being unable to find witness 
maps for programs (in the training data) involving certain 
DSL terminals. In these cases, it prompts the user to identify 
a word in an input sentence that corresponds to an unmapped 
terminal in a program. These new seed words are then further 
used to build a more extensive synonym set using WordNet. 


5.4 Learning Combination Weights 

In the previous section, we defined 3 component scores for 
a translation. A standard mechanism for combining multiple 
scores into a single final score is to use a weighted sum of 
the component scores. In this section we describe a novel 
method for learning the required weights to use to maximize 
the following function. 

Optimization Function: Number of benchmarks in the train¬ 
ing set, for which the correct translation is assigned rank 1. 


In numerical optimization maximization of an optimization 
function is a standard problem which can be solved using 
stochastic gradient descent [5]. In order to use gradient de¬ 
scent to find the weight values that maximize our optimiza¬ 
tion function we need to define a continuous and differen¬ 
tiable loss function, Fioss- This loss function is used to guide 
the iterative search for a set of weights that maximizes the 
value of the optimization function as follows: 

wf+l = Wn-y\/Floss(wn) n = 0, 1,2,.. 

where y denotes the gradient and y is a positive constant. At 
each step, w moves in the direction in which the value of Fioss 
decreases and the process is stopped when the change in the 
function value in successive steps drops below a specified 
threshold e. 

A common form for loss functions is a sigmoid. We can 
convert our ill-behaved optimization function into a loss 
function that is closer to what is needed to perform gradient 
descent by basing the sigmoid on the ratio score given to the 
best incorrect result and the score given to the desired rate 
via the following construction: 


V training S 

f{w, S) = - 7:^ — TV where X = 


^ wrong 


Ac> 0 


\e ScOCciPriesired) 

Pdesired = Correct translation of S 
^ wrong — mflx({Score(P)|P G Bag{S) A Pf Pdesired}) 


Although the above transformation results in a loss func¬ 
tion which is mostly well behaved, it saturates appropriately 
and is piecewise continuous and differentiable, there are still 
points were the function is not continuous. In particular the 
presence of the max function in the definition of v„rong cre¬ 
ates discontinuous points in Fioss- However, the following in¬ 
sight enables us to replace the discontinuous max operation 
with a continuous approximation: 

max(a,fi) « log(e“ -|-e'^*)/c where c>lifa<^b\/b^a 


Thus, we can replace the max operator with this function, 
extended in the natural way to k arguments, in the compu¬ 
tation of v„rong to produce a globally continuous and dif¬ 
ferentiable loss function. The cases where there are several 
incorrect results which are given very similar scores are min¬ 
imized by the selection of a large value for c, which amplifies 
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small differences. Additionally, in the worst case where two 
scores are extremely close, the impact of the approximation 
is to drive the gradient descent to increase the ratio between 
and ScoTe{Pjgsired)- Thus, the correctness of the gradi¬ 
ent descent algorithm is not impacted. 

In addition to satisfying the basic requirements for per¬ 
forming gradient descent, our loss function, Fioss, saturates 
for large values of X. This implies that if an input S, has 
Score(fJ^ ,i) ^ 1 "'ll! *10'^ dominate the gradient descent 

causing it to improve the ranking results for a single bench¬ 
mark at the expense of rank quality on a large number of 
other benchmarks. The saturation also implies that the de¬ 
scent will not become stuck trying to find weights for an 
input where there is no assignment to the weights that will 
improve the ranking, i.e., there is an incorrect result program 
Pi where every component score has a higher value than the 
desired program P^i. 

6. Experimental Evaluation 

The (online) synthesis algorithm, consisting of the Bag algo¬ 
rithm and feature extraction (for ranking), was implemented 
in C# and used the Stanford NLP Engine (Version 2.0.2) [48] 
with its default configuration for POS tagging and extract¬ 
ing other NL features. The offline gradient decent was im¬ 
plemented in C# while the classifiers used for training the 
component features were built using MATLAB. 

A major goal of this research is the production of a 
generic framework for synthesizing programs in a given 
DSL from English sentences. Thus, we selected three dif¬ 
ferent categories of tasks, question answering (Air Travel 
Information System), constraint based model construction 
(Automata Theory Tutoring), and command execution on 
unstructured data (Repetitive Text Editing). These domains, 
described in detail in §2, present a variety of structure in 
the underlying DSL, the language idioms that are used, and 
the complexity of the English sentences that are seen. Eor 
benchmarks, automata descriptions are taken verbatim from 
textbooks and online assignments. Text editing descriptions 
are taken verbatim from help forums and user studies. ATIS 
descriptions are part of a standard suite. Tables 1(b), 1(c), 2 
and 3 describe a sample of these benchmarks. 

Air Travel Information System (ATIS). We selected 535 
queries at random from the full ATIS suite (which consists 
of few thousand queries) and, by hand, constructed the corre¬ 
sponding program in our DSL to realize the query. Each task 
in ATIS domain is a query over flight related information. 

Automata Theory Tutoring. We collected 245 natural lan¬ 
guage specifications (accepting conditions) of finite state au¬ 
tomata from books and online courses on automata theory. 

Repetitive Text Editing. We collected a description of 21 
text editing tasks from Excel books and help forums. We 
collected 265 English descriptions for these 21 tasks via a 
user study, which involved 25 participants (who were first 



Eigure 1; Ranking precision of algorithm on all domains. 


and second year undergraduate students). The large number 
of participants ensured variety in the English descriptions 
(e.g., see Table 1(c)). In order to remove any description 
bias, each of these tasks was described not using English 
but using representative pairs of input and output examples. 
Additionally, we obtained 227 English descriptions for 227 
text editing tasks (one for each task) from an independent 
corpus [34]. 

6.1 Precision, Recall, and Computational Cost 

In this study we used standard 10-fold cross-validation to 
evaluate the precision and recall of the translators on each of 
the domains. Thus, we select 90% of the data at random to 
use for learning the classiflers/weights and then evaluate the 
system on the remaining 10% of data which was held back 
(and not seen during training). In the ranking we handle ties 
in the scores assigned to an element using a 1334 ranking 
scheme [43]. In 1334 ranking, in the case of tied scores, each 
element in the tied group is assigned a rank corresponding to 
the lowest position in the ordered result list (as opposed to 
the highest). This ensures that the reported results represent 
the worst case number of items that may appear in a ranked 
list before the desired program is found. 

Precision. Eig. 1 shows the percentage of inputs for which 
the desired program in the DSL is the top ranked result, the 
percentage of inputs where the desired result is in the first 
three results, and the percentage where the desired result 
may be more than three entries down in the result list. As 
shown in the figure, for every domain, on over 80% of 
the inputs the desired program is unambiguously identified 
as the top ranked result. Eurther, for the ATIS domain the 
desired result is the top ranked result for 88.4% of the natural 
language inputs. Given the size of our sample from the full 
ATIS suite we can infer that the desired program will be the 
top ranked result for 88.4 ±4.2% of the natural language 
inputs at a 95% confidence interval. These results show 
that our novel program synthesis based translation approach 
is competitive with the state-of-the-art natural language 
processing systems: 85% in Zettlemoyer and Collins [51], 
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Figure 2; Timing performance of algorithm on all domains. 


84% in Poon [39], and 83% in Kwiatkowski, Zettlemoyer, 
Goldwater, and Steedman [25]. 

Recall. In addition to consistently producing the desired 
program as the top ranked result for most inputs the ranking 
algorithm places the desired program in the top 3 results an 
additional 5%-12% of the time. Thus, across all three of the 
domains, for over 90% of the natural language inputs the 
desired program is one of the three top ranked results. This 
leaves less than 10% of the inputs for any of the domains, 
and only 5% in the case of the text editing domain, where 
the synthesizer was unable to produce and place the desired 
program in the top three spots. 

Computational Cost. Fig. 2 shows the distribution of the 
time required to run the synthesis algorithm and perform 
the ranking. On average translation takes 0.68 seconds for 
Text Editing, 1.72 seconds for Automata and 1.38 seconds 
for the ATIS inputs. Further, the distribution of times is 
heavily skewed with more than 85% of the inputs taking 
under 1 second and very few taking more than 3 seconds. 
The outliers tend to be inputs in which the user has specified 
an action in an exceptionally redundant manner. 

6.2 Individual Component Evaluation 

In §4 we defined various components for ranking and pro¬ 
vided intuition into their usefulness. To validate that these 
component scores are, in fact, important to achieving good 
results we evaluated our choices by using various subsets of 
the component scores, learning the best weights for the sub¬ 
set, and re-ranking the programs. 

Performance of Individual Scores. The results of us¬ 
ing each component in isolation are presented in Table 5. 
This table shows that when identifying the top-ranked pro¬ 
gram the best performance for using only CoverageScore is 
17.9%, using MappingScore is 31.0% and for StmctureScore 
is 51.4%. This is far worse than the result obtained by using 
the combined ranking which placed the desired program as 
the top result for 84.9% of the inputs. Thus, we conclude 
that the components are not individually sufficient. 


Domain 

1334 Top Rank 

CoverageScore 

MappingScore 

StructureScore 

ATIS 

5.6% 

1.9% 

20 .2% 

Automata 

17.9% 

31.0% 

51.4% 

Text Editing 

8 .1% 

8.9% 

34.4% 


Table 5; Performance of individual component scores in 
ranking the desired program as top result. 


Domain 

% change on dropping 

CoverageScore 

MappingScore 

StructureScore 

ATIS 

-30.6% 

-4.7% 

-81.9% 

Automata 

-22.4% 

-2.0% 

-47.7% 

Text Editing 

-19.7% 

-4.7% 

-65.8% 


Table 6: Impact of dropping individual component scores on 
top rank percentage. 


Score Independence. Although these results show that in¬ 
dependently none of the components are sufficient for the 
program ranking it may be the case that one of the com¬ 
ponents is, effectively, a combination of the other two. Ta¬ 
ble 6 shows the results of ranking the programs when drop¬ 
ping one of the components. Dropping StmctureScore re¬ 
sults in the largest decrease, as high as 81.86% in the worst 
case, and even the best case has a decrease of 47.75%. Drop¬ 
ping CoverageScore also results in substantial degradation, 
although not as high as for StructureScore. The impact of 
dropping MappingScore is much smaller, between 2.04% 
and 4.67%. However, the consistent positive contribution of 
MappingScore shows that it still provides useful information 
for the ranking. Thus, all of the components provide distinct 
and useful information. 

Dictionary Construction. In practice the semi-automated 
approach makes dictionary construction a task that, while 
usually requiring manual assistance, does not require exper¬ 
tise in natural language processing or program synthesis. On 
average the dictionaries for each domain contained 144 En¬ 
glish words averaging 4.51 words/terminal and 1.48 termi- 
nals/word. The user was prompted to provide 20.7% map¬ 
pings on average across the three DSLs. Although beyond 
the scope of this work, as it requires a larger corpus of train¬ 
ing data, the amount of user intervention can be further re¬ 
duced by using statistical alignment to automatically extract 
the domain specific synonyms from the training data. 

Score Combination Weights. We used gradient descent to 
learn how much to weight each score in the computation 
of the final rank of a program. To evaluate the quality of 
the weights identified via the gradient descent we compared 
them with a naive selection of equal weights for all the com¬ 
ponent scores and with the results of boosting. Boosting [8] 


11 


2015 / 9/2 























Domain 

Total 

Count 

Equal Wt. 

Rank Boost 

Gradient 

Top 

Top-3 

Top 

Top-3 

Top 

Top-3 

ATIS 

535 

73.2% 

90.8 % 

79.8% 

89.9 % 

88.4% 

93.3 % 

Automata 

245 

74.2% 

91.4% 

73.1% 

93.5 % 

84.9% 

91.2% 

Text Editing 

492 

74.0% 

91.1 % 

74.4% 

91.8% 

82.3% 

94.9 % 


Table 7: Comparison of ranking using equal weights, gradi¬ 
ent descent, and RankBoost. Column “Top”(“Top-3”) shows 
the percentage of benchmarks where the correct translation 
is ranked 1 (ranked in top 3). 


is a frequently technique which combines a set of weaker 
rankings, such as the individual component scores, to pro¬ 
duce a single strong ranking. Table 7 shows the results of 
the rankings obtained with the three approaches. 

The results show that using gradient descent has im¬ 
proved the number of top ranked benchmarks significantly 
over the naive weight selection (as large as 15%). However, 
the improvement in the top 3 ranked benchmarks is much 
smaller. Similarly, the gradient descent approach produces 
substantially better results than RankBoost with an average 
difference of 9% in the top ranked benchmarks. Thus, we 
can conclude that the use of gradient descent for learning 
the combination weights is an important factor in the overall 
quality of the results. 

Our choice of the ranking functions is critical to the qual¬ 
ity of results. As shown in Table 6, dropping any of the 
component functions results in a substantial loss of preci¬ 
sion. Also, using a simpler method, such as equal weights or 
boosting [8], to compute the combination weights results in 
a loss of 9-15% in precision when compared to the use of 
gradient descent (Table 7). 

In our system, most failures (i.e. the correct solution 
failing to rank in the top-three solutions) arise because some 
key information is left implicit in the English description, 
e.g. “I want to fly to Chicago on August 15”. In this case, 
the departing city should default to “CURRENT_CITY” , 
and the time should default to “ANY” . Such issues might 
be fixed either by having orders of magnitude larger training 
data or by building some specialized support for handling 
implicit contextual information in various domains. 

As part of learning the weights for the component scores 
we used a shifted variant of the logistic function as our loss 
function (§5). Eig. 3 shows how the value of loss changes 
with iteration index and the corresponding number of top 
ranked benchmarks. It can be seen that as the loss value 
decreases, the number of top ranked benchmarks increases 
and vice-a-versa. Thus, as these values are negatively corre¬ 
lated as needed for optimal performance of the gradient de¬ 
scent algorithm, and even though our loss function contains 
the log-exponential approximation of the max operation it is 
well behaved for the gradient descent algorithm. 



Iteration Index 


Figure 3: Behavior of Loss Function and Top Ranks. 


Domain 

% char 

ATIS 

ige on usin^ 

Automata 

weights learnt for 
Text Editing 

ATIS 

0 .0% 

-1.5% 

-5.0% 

Automata 

-2.0% 

0 .0% 

-1.2% 

Text Editing 

-0.6% 

-0.8% 

0 .0% 


Table 8; Generalization of learning across domains. 


Since the weights learned in §4.2 are general purpose, we 
expect that weights learned from one domain are applicable 
to other domains, eliminating the the time and effort required 
to re-learn these values on each new domain. The results in 
table 8 show that the weight vectors that are learned for one 
domain perform well when used to rank the results for a new 
domain. The average decrease in the number of top ranked 
programs is only 1.9% (with a maximum decrease of 5.0%). 
For the number of top 3 ranked programs the change is 
insignificant with a maximum decrease of less than 0.5% and 
thus we do not include them here. This result demonstrates 
that the learning of the component weights is highly domain 
independent and generalizes well, allowing it to be reused 
(or used as a starting point) for new domains. 

7. Related Work 

PBE/PBD Techniques for Data Manipulation Program¬ 
ming by demonstration (PBD) systems, which use a trace 
of a task performed by a user, and programming by exam¬ 
ple (PBE) systems, which learn from a set of input-output 
examples, have been used to enable end-user programming 
for a variety of domains. For PBD these domains include text 
manipulation [26] and table transformations [21] among oth¬ 
ers [7]. Recent work on PBE by Gulwani et. al. has included 
domains for manipulating strings [11, 45], numbers [44], 
and tables [19]. As mentioned earlier, both PBD and PBE 
based techniques struggle when the desired transformations 
involve conditional operations. In contrast the natural lan¬ 
guage based approach in this work performs well for both 
simple and conditional operations. 

Keyword Programming Keyword programming refers to 
the process of translating a set or sequence of keywords into 
function calls over some API. This API may consist either of 
operations in an existing programming language [31, 38,49] 
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or a DSL constructed for a specific class of tasks [30, 32]. 
Keyword programming techniques use various program syn¬ 
thesis approaches to build expression trees from the ele¬ 
ments of the underlying API, similar to the Bag algorithm 
in §4, and then use simple heuristics, such as words used 
and keyword-to-terminal weights, to rank the resulting ex¬ 
pression trees. These systems have low precision and, as a 
result, will frequently suggest incorrect programs. 

Semantic Parsing Semantic parsing [36] is a sophisticated 
means of constructing a program from natural language us¬ 
ing a specialized language parser. Several approaches in¬ 
cluding, syntax directed [23], NLP parse trees [9], SVM 
driven [22], combinatory categorical grammars [25, 50, 51], 
and dependency-based semantics [28, 39] have been pro¬ 
posed. These systems have high precision, usually suggest¬ 
ing the correct program, but low recall and often do not re¬ 
turn any suggestions at all. In contrast the technique in this 
paper achieves similar levels of precision but does not suffer 
from low recall. 

Natural Language Based Programming A number of 
natural language programming systems have been built 
around grammars, NLC [4], or templates, NaturalJava [42], 
which impose various constraints on input expressions. Such 
systems are sensitive to grammatical errors or extraneous 
words. There has been extensive research on developing 
natural language interfaces to databases (NLIDB) [2, 37]. 
While early systems were based on pattern matching the 
user’s input to one of the known patterns, PRECISE [40,41] 
translates semantically tractable NL questions into cor¬ 
responding SQL queries. However, these systems depend 
heavily on the underlying data having a known schema 
which makes them impractical when the underlying data 
structure is unknown (or non-existent) as in the text-editing 
domain used in this work. 

SmartSynth [27] is a system for synthesizing smartphone 
scripts from NL. The synthesis technique in SmartSynth is 
highly specialized to the underlying smartphone domain and 
uses a simple the ranking strategy for the programs that 
it produces. Similarly, the NLyze [15] system synthesizes 
spreadsheet formulas from NL. Again, NLyze is designed 
for a specific domain (spreadsheet formula) and uses a rela¬ 
tively simple ranking system consisting of only the equiv¬ 
alent of the coverage, mapping, and overlap features pre¬ 
sented in our paper. In contrast, the system presented in this 
paper is agnostic to the specihcs of the target DSL, the rank¬ 
ing features are independent of the underlying DSL, and we 
automatically learn appropriate weights for the features. In 
addition, as the experimental results in Table 6 demonstrate, 
the use of a simpler ranking system, as in SmartSynth or 
NLyze, results in substantial reductions in recall/precision. 
Thus, the approach in this paper can be seen as an improve¬ 
ment and generalization of these previous systems. 


8. Conclusion 

Today billions of end-users have access to computational de¬ 
vices, yet lack the programming knowledge to effectively 
interact with these devices. Most of these users are want¬ 
ing to write small programs or specihcations that can be de¬ 
scribed succinctly in some appropriate domain-specific lan¬ 
guage that provides the right level of abstractions. These 
users are stuck because of the need to provide step-by-step, 
detailed, and syntactically correct instructions to the com¬ 
puter. Program synthesis has the potential to revolutionize 
this landscape, when targeted for the right set of problems 
and using the right interaction model. 

Programming-by-example has been shown to be a very 
effective tool—a recent instance being release of the Plash 
Pill feature [11] as part of Microsoft Excel 2013 among rave 
reviews [13]. We observe that there are several domains for 
which examples is not a natural form of specihcation (or 
where too many examples would be required), but those 
tasks can be easily expressed in a natural language. 

We presented a novel technology for synthesizing pro¬ 
grams from natural language descriptions (based on gener¬ 
ating and ranking programs from a set of terminals that cor¬ 
respond to the words in the natural language description). 
More signihcantly, we showed how this framework allows 
creating synthesizers for different DSLs by simply provid¬ 
ing examples of translations. 

We believe that technique will work with off-the-shelf 
DSLs without major modifications provided the DSL is 
functional without binding constructs such as temporary 
variables or quantihers and has a level of abstraction with 
a direct correspondence to the abstraction being used in the 
natural language. Por example, our approach will not work 
well for translating descriptions for automata construction 
problems into a target DSL of regular expressions because 
there is no direct correspondence. This lack of correspon¬ 
dence between the source language and the target DSL re¬ 
quires the translator to use non-trivial logical reasoning dur¬ 
ing the conversion and greatly reduces the effectiveness of 
our system. 

As with any program synthesis technique which funda¬ 
mentally involve search over exponential spaces, the cost of 
our technique is also worst case exponential in the size of 
the DSL. However, the key issue is doing this efficiently for 
practical cases. Our synthesis works efficiently (usually un¬ 
der 1 second) for a range of useful DSLs. The size of the dic¬ 
tionary has minimal impact on the runtime as the translation 
only depends on the subset of the dictionary corresponding 
to the words in the input sentence. 

In future, we aim to further generalize the framework 
to allow synthesis of synthesizers for a wider variety of 
domains. 
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